The events of the devastating COVID-19 pandemic have been felt in all areas of human civilization. With the rapid development of mRNA vaccines to combat the spread of the virus, and reduce associated mortality and morbidity, societies have been slowly returning to a more normal way of life. However, this has not been uniform across countries. While in certain nations, vaccination campaigns have been largely successful, others have not fared as well and have faced a protracted pandemic. The efficacy of a nation’s COVID-19 vaccination campaign is thus imperative to the well-being of that nation’s populace, its global and domestic economy, and the prevention of the emergence of novel viral variants. Understanding the country-level factors that have led some nations to effectively vaccinate their population, and others to struggle with this process is, naturally, critical to tailoring international efforts to increase vaccination, and reduce the harm caused by the COVID-19 pandemic. Furthermore, the progression of Global Warming increases the likelihood of COVID-19-like pandemics in the future. Developing models to predict nations’ vaccination effectiveness is of the utmost importance, and could guide the focus of future global resource allocation.
Given the above, the goals of this project are: 1. To elucidate the country-level variables that are predictive of a nation’s vaccination success. 2. To develop models that can predict a nation’s vaccination success given social, political, and economic variables, as well as features of disease toll prior to vaccine availability.
To address the goals of our project, we have combined the following data sets to form an array of predictors:
The success of a nation’s vaccination campaign reflected in many variables, including number of vaccinated, days to vaccinating percentages of the population, and daily vaccinations. Describing, quantifying, and predicting such a multifaceted concept can be a challenge. To address this, we have developed a quantified measure of vaccination success that takes into account several such factors pertaining to national vaccination campaigns. We have termed this measure the Vaccination Success Index (VSI). Demonstrated in more detail further, the VSI was developed by applying Principal Component Analysis (PCA) to various measures of world nations’ vaccination campaigns. VSI serves as our primary outcome measure.
Our analyses have yielded several useful insights that may help inform global policy decisions as international resources are allocated to help struggling nations increase vaccination success.
Linear Regression models based on economic factors, disease burden factors, and political factors, as well as data-driven LASSO and Regsubsets models, have converged on the following variables as significantly predictive of vaccination success:
Furthermore, we have developed more robust models that predict VSI with varying degrees of accuracy (Linear Regression & LASSO, Trees & Random Forest, and Artificial Neural Networks). While predictive accuracy was moderate for each model individually, as well as an ensemble model that takes the weighted averages the predictions of all models, we have achieved a level of predictive ability that could help forecast vaccination success in subsequent COVID-19 vaccination campaigns and future pandemics.
We have developed a robust vaccination index capturing a country’s past vaccination success and our models highlighted several political, economic, social and health-related factors that have a crucial influence on a country’s ability to vaccinate its populace. However, this analysis comes with certain limitations. First, the dataset is constrained to countries for which a sufficient amount of data was present to compute both VSI and summary political, economic, social and health-related factors. While this was possible for a substantial number of countries (n = 99), some countries had to be dropped (n = 126 from the vaccination data, n = 42 from the world sustainability data) including large countries such as China, which did not sufficiently report on vaccination data. By obtaining more complete information, future studies may further improve the completeness of this dataset and thereby increase the accuracy of VSI predictions. Moreover, in addition to the factors found to influence VSI in this study, it is likely that other factors influence a country’s vaccination success, including more psychosocial and individual characteristics related to beliefs around vaccinations; however, it’s likely that some of these latent factors may be captured in our data as they relate to educational attainment, regime type, and life expectancy when interpreted as a broad indicator of a country’s healthcare and social safety net.
In conclusion, we were able to quantitatively define the success of each country’s vaccination campaign against COVID-19 and to identify key, modifiable factors that may contribute. While the modest predictive ability of our final model on unseen data is one of several limitations to consider in this work, the COVID-19 crisis has presented a multifaceted and evolving challenge to governments across the globe. Therefore, our analysis represents an important contribution to objectively assessing each country’s response and identifying factors that should be considered in addressing future global health crises.
Before starting to do more involved EDA, we will preprocess our data to ensure we have workable datasets to begin our exploration with. At the completion of preprocessing, we will have two longitudinal dataframes: one containing WSD predictors and one containing COVID case, death, and vaccination data. These dataframes will subsequently be summarized to capture the most important features of each variable by country, enabling us to merge the dataframes and model on these synthesized per-country metrics.
First, we process the World Sustainability Dataset. This dataset contains several missing elements labelled as empty instead of NA. Out first step is to replace these. Next, we remove columns that have less than 90% complete rate, and remove rows with NAs in these remaining columns.
After cleaning, we are left with 141 countries, which is a workable amount. Next, we rename some columns to make them readable, turn strings into factors where appropriate, and export our processed data to “WorldSustainabilityData_Processed.csv”.
We’ll do some exploratory plotting for a few key predictors across all years of reported data.
Further visualizations can be found in the Appendix under Additional EDA.
Next, we’ll try several different methods of summarizing this longitudinal WSD data into a single “timepoint”. We selected the latest data reported for each country (restricted to 2016 or later) and calculated: the average of each parameter over time (2001 to 2018); the average of the last 3 years of data (2016-2018); and the relative and absolute change in each parameter over time (2001-2018). In order to maximize completeness and recency of the dataset while minimizing colinearity, we decided to use each country’s latest data and data averaged over 17 years in our models below. Code for unused summary dataframes can be found in the appendix.
First we extract just the last year of reported data for each country, restricted to 2016 or later. As seen below, the majority of countries have 2018 data, though we are able to preserve an additional 23 countries by reaching back to 2016 data.
##
## 2016 2017 2018
## 6 17 109
As seen below, there is a great deal of variability in how many years each country has data reported in the WSD dataset:
##
## 1 3 4 5 6 8 9 10 11 12 13 14 15 16 17 18
## 1 2 4 1 3 2 5 3 6 6 8 1 13 5 21 60
This is a limitation that should be kept in mind when interpreting the results of our model and the predictive power of averaged variables in particular, since an averaged GDP for one country might reflect 17 years of data while the averaged GDP of another may only reflect 3 years. A full graphical representation of years reported by country can be found in the Appendix.
Next we’ll merge the latest data and averaged data into one wide wsd dataframe and remove any variables that are highly correlated:
## # A tibble: 6 × 58
## country Country.Code Continent WorldRegion Year_latest Electricity.Acc…
## <fct> <chr> <chr> <fct> <int> <dbl>
## 1 Iran IRN Asia Central an… 2018 100
## 2 Bosnia And Herzegovina BIH Europe Europe and… 2018 100
## 3 Cote D Ivoire CIV Africa Sub-Sahara… 2018 67.2
## 4 Guinea Bissau GNB Africa Sub-Sahara… 2018 28.5
## 5 Kyrgyzstan KGZ Asia Central an… 2017 100
## 6 Macedonia MKD Europe Europe and… 2018 99.8
## # … with 52 more variables: Adj.SavingNetCO2Damage_latest <dbl>,
## # Adj.SavingNaturalResourceDepletion_latest <dbl>,
## # Adj.SavingNetForestDepletion_latest <dbl>,
## # Adj.SavingParticulateEmissions_latest <dbl>,
## # CompulsoryEducationDurationYears_latest <int>,
## # ExportGoodsServices.GDP_latest <dbl>,
## # FinalConsumptionExpenditure.GDP_latest <dbl>, GDP.Current_latest <dbl>, …
Next, we will read in COVID cases/deaths and vaccination data sets. We remove countries with too many NAs and generate new variables indicating cases and deaths per hundred, to account for the total population in each country. Finally, we’ll merge the case, death, and vaccination info into a single dataframe.
## date country cumulative_total_cases daily_new_cases active_cases
## 1 2020-02-15 Afghanistan 0 NA 0
## 2 2020-02-16 Afghanistan 0 NA 0
## 3 2020-02-17 Afghanistan 0 NA 0
## 4 2020-02-18 Afghanistan 0 NA 0
## 5 2020-02-19 Afghanistan 0 NA 0
## 6 2020-02-20 Afghanistan 0 NA 0
## cumulative_total_deaths daily_new_deaths year month_num week month
## 1 0 NA 2020 2 7 Feb
## 2 0 NA 2020 2 7 Feb
## 3 0 NA 2020 2 7 Feb
## 4 0 NA 2020 2 7 Feb
## 5 0 NA 2020 2 8 Feb
## 6 0 NA 2020 2 8 Feb
## [1] country vaccination_events
## [3] people_vaccinated_per_hundred
## <0 rows> (or 0-length row.names)
| Name | covid |
| Number of rows | 170917 |
| Number of columns | 42 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| Date | 2 |
| factor | 6 |
| numeric | 30 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| month | 0 | 1.00 | 3 | 3 | 0 | 12 | 0 |
| source_name | 168520 | 0.01 | 9 | 70 | 0 | 73 | 0 |
| source_website | 168520 | 0.01 | 16 | 248 | 0 | 103 | 0 |
| binned_people_vaccinated_per_hundred | 30685 | 0.82 | 6 | 9 | 0 | 10 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date | 0 | 1 | 2020-01-22 | 2022-03-17 | 2021-03-02 | 786 |
| month_year | 0 | 1 | 2020-01-01 | 2022-03-01 | 2021-03-01 | 27 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| country | 0 | 1.00 | FALSE | 225 | Chi: 786, Uni: 762, Gam: 762, Lao: 762 |
| week | 0 | 1.00 | FALSE | 53 | 8: 4711, 9: 4711, 10: 4711, 11: 4486 |
| iso_code | 168520 | 0.01 | FALSE | 184 | ARG: 16, BEL: 16, BHR: 16, CAN: 16 |
| vaccines | 168520 | 0.01 | FALSE | 79 | Joh: 257, Oxf: 137, Oxf: 124, Joh: 108 |
| vaccination_events | 169784 | 0.01 | FALSE | 11 | no: 225, vac: 178, vac: 144, vac: 112 |
| vaccine_dose_events | 170545 | 0.00 | FALSE | 4 | no: 225, 1_d: 112, 2_d: 33, 3_d: 2 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| cumulative_total_cases | 0 | 1.00 | 6.14e+05 | 3.21e+06 | 0.00 | 8.80e+02 | 1.54e+04 | 1.91e+05 | 8.14e+07 | ▇▁▁▁▁ |
| daily_new_cases | 9680 | 0.94 | 2.89e+03 | 1.75e+04 | -322.00 | 1.00e+00 | 6.30e+01 | 7.45e+02 | 9.09e+05 | ▇▁▁▁▁ |
| active_cases | 9906 | 0.94 | 7.70e+04 | 7.17e+05 | -14321.00 | 6.30e+01 | 1.48e+03 | 1.54e+04 | 2.94e+07 | ▇▁▁▁▁ |
| cumulative_total_deaths | 6858 | 0.96 | 1.28e+04 | 5.65e+04 | 0.00 | 2.00e+01 | 2.74e+02 | 3.63e+03 | 9.96e+05 | ▇▁▁▁▁ |
| daily_new_deaths | 25854 | 0.85 | 4.20e+01 | 1.88e+02 | -39.00 | 0.00e+00 | 1.00e+00 | 1.30e+01 | 5.04e+03 | ▇▁▁▁▁ |
| year | 0 | 1.00 | 2.02e+03 | 6.50e-01 | 2020.00 | 2.02e+03 | 2.02e+03 | 2.02e+03 | 2.02e+03 | ▇▁▇▁▂ |
| month_num | 0 | 1.00 | 6.36e+00 | 3.47e+00 | 1.00 | 3.00e+00 | 6.00e+00 | 9.00e+00 | 1.20e+01 | ▇▅▅▅▇ |
| population | 27432 | 0.84 | 4.01e+07 | 1.52e+08 | 1614.00 | 1.27e+06 | 8.48e+06 | 2.84e+07 | 1.44e+09 | ▇▁▁▁▁ |
| daily_new_cases_per_hundred | 35645 | 0.79 | 2.00e-02 | 7.00e-02 | 0.00 | 0.00e+00 | 0.00e+00 | 1.00e-02 | 4.47e+00 | ▇▁▁▁▁ |
| daily_new_deaths_per_hundred | 46288 | 0.73 | 0.00e+00 | 0.00e+00 | 0.00 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 3.00e-02 | ▇▁▁▁▁ |
| cumulative_total_cases_per_hundred | 27432 | 0.84 | 2.98e+00 | 5.57e+00 | 0.00 | 4.00e-02 | 4.00e-01 | 3.56e+00 | 6.98e+01 | ▇▁▁▁▁ |
| active_cases_per_hundred | 35052 | 0.79 | 3.20e-01 | 1.34e+00 | -0.03 | 0.00e+00 | 3.00e-02 | 2.20e-01 | 5.41e+01 | ▇▁▁▁▁ |
| cumulative_total_deaths_per_hundred | 31242 | 0.82 | 5.00e-02 | 8.00e-02 | 0.00 | 0.00e+00 | 1.00e-02 | 6.00e-02 | 6.30e-01 | ▇▁▁▁▁ |
| monthly_new_cases | 0 | 1.00 | 8.05e+04 | 4.77e+05 | 0.00 | 1.25e+02 | 1.94e+03 | 2.09e+04 | 2.04e+07 | ▇▁▁▁▁ |
| monthly_new_deaths | 0 | 1.00 | 1.07e+03 | 5.02e+03 | 0.00 | 0.00e+00 | 2.20e+01 | 2.85e+02 | 1.35e+05 | ▇▁▁▁▁ |
| monthly_new_cases_per_hundred | 0 | 1.00 | 4.10e-01 | 1.35e+00 | 0.00 | 0.00e+00 | 2.00e-02 | 2.50e-01 | 2.98e+01 | ▇▁▁▁▁ |
| monthly_new_deaths_per_hundred | 0 | 1.00 | 0.00e+00 | 1.00e-02 | 0.00 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 2.10e-01 | ▇▁▁▁▁ |
| weekly_new_cases | 9165 | 0.95 | 2.88e+03 | 1.71e+04 | -28.14 | 5.25e+00 | 7.50e+01 | 7.62e+02 | 8.22e+05 | ▇▁▁▁▁ |
| weekly_new_deaths | 25360 | 0.85 | 4.18e+01 | 1.82e+02 | -3.71 | 0.00e+00 | 1.50e+00 | 1.29e+01 | 4.71e+03 | ▇▁▁▁▁ |
| weekly_new_cases_per_hundred | 0 | 1.00 | 9.00e-02 | 3.20e-01 | 0.00 | 0.00e+00 | 0.00e+00 | 5.00e-02 | 9.77e+00 | ▇▁▁▁▁ |
| weekly_new_deaths_per_hundred | 0 | 1.00 | 0.00e+00 | 0.00e+00 | 0.00 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 1.00e-01 | ▇▁▁▁▁ |
| total_vaccinations | 168560 | 0.01 | 2.64e+07 | 1.68e+08 | 0.00 | 1.06e+05 | 1.08e+06 | 8.62e+06 | 3.20e+09 | ▇▁▁▁▁ |
| people_vaccinated | 168520 | 0.01 | 1.41e+07 | 8.22e+07 | 0.00 | 7.35e+04 | 7.49e+05 | 4.77e+06 | 1.27e+09 | ▇▁▁▁▁ |
| people_fully_vaccinated | 168920 | 0.01 | 1.26e+07 | 7.55e+07 | 1.00 | 7.80e+04 | 7.25e+05 | 4.37e+06 | 1.24e+09 | ▇▁▁▁▁ |
| daily_vaccinations_raw | 169966 | 0.01 | 2.81e+05 | 1.30e+06 | 0.00 | 3.79e+03 | 2.53e+04 | 1.27e+05 | 2.04e+07 | ▇▁▁▁▁ |
| daily_vaccinations | 168691 | 0.01 | 1.41e+05 | 7.56e+05 | 0.00 | 1.22e+03 | 9.71e+03 | 5.56e+04 | 1.74e+07 | ▇▁▁▁▁ |
| total_vaccinations_per_hundred | 168560 | 0.01 | 6.09e+01 | 6.46e+01 | 0.00 | 4.19e+00 | 3.53e+01 | 1.11e+02 | 3.36e+02 | ▇▃▂▁▁ |
| people_vaccinated_per_hundred | 168520 | 0.01 | 3.20e+01 | 2.96e+01 | 0.00 | 3.17e+00 | 2.41e+01 | 6.02e+01 | 1.25e+02 | ▇▃▃▂▁ |
| people_fully_vaccinated_per_hundred | 168920 | 0.01 | 2.98e+01 | 2.81e+01 | 0.00 | 3.29e+00 | 2.09e+01 | 5.59e+01 | 1.21e+02 | ▇▂▃▁▁ |
| daily_vaccinations_per_million | 168691 | 0.01 | 3.17e+03 | 4.03e+03 | 0.00 | 5.99e+02 | 2.03e+03 | 4.49e+03 | 9.29e+04 | ▇▁▁▁▁ |
We will start by visualizing some of the most relevant variables and patterns in the cases, deaths and vaccinations data.
There is quite a wide range of vaccination rates over time, with some countriesreaching high levels of vaccination at a very fast rate such as Gibraltar and Cuba, some countries showing more progressively increasing rates (e.g., European countries) and some countries having low vaccination rates up until today.
Let’s bin countries by their latest achieved vaccination rate (per hundred to account for their total populations):
As we can see, vaccination rates are currently sill very variables across the world.
Let’s now pick a few countries to examine the evolution of the number of cases with respect to important vaccination milestones, such as:
Keep in mind that some countries have not yet reached some of these milestones as of today and therefore, a different number of milestones may be present for each country represented below.
Based on the plots above, we can see that countries had variable vaccination start dates and that each has reach a different number of vaccination milestones. For instance, Cuba started vaccinating relatively late (May 2021) but had 90% of its population vaccinated by January 2022. In the US, vaccinations started in December 2020 and have reached a rate of 70% in December 2021. It is unclear from these graphs whether vaccinations have direct consequences on the number of cases and/or deaths.
In order to come up with meaningful summary predictors and dependent variables that are compatible with our WSD, we’ll next summaries our data on covid cases, deaths, and vaccinations. For our purposes in this project, we’ll treat cases and deaths prior to each country’s vaccine roll-out as independent variables (predictors) alongside the WSD. Vaccination data will be synthesized using PCA to create a the Vaccination Success Index (VSI), a composite score which we will use to indicate how well a country was able to vaccinate its citizens.
First, we create a dataframe with the latest COVID data from each country.
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(missing_vac_cols)` instead of `missing_vac_cols` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## "2022-01-07" "2022-03-01" "2022-03-02" "2022-03-02" "2022-03-06" "2022-03-17"
Every country has some data from March 17, 2022, though some countries’ last reported vaccination data was from as early as 1.9^{4}.
Next, we’ll calculate predictors from COVID case and death rates prior to the day each country reported starting to vaccinate.
## `mutate_if()` ignored the following grouping variables:
## Column `country`
Finally we’ll merge these COVID predictors into a single dataframe and remove strongly colinear predictors.
Since defining a country’s success in administering vaccines is multifaceted, we’ll construct the VSI on several summary metrics in addition to final vaccination counts. For instance, it is likely meaningful to account for how long it took each country to reach certain milestones (such as first starting to administer vaccines, administering X doses to each citizen, or vaccinating X% of the population) as well its peak rate of vaccination (the maximum rate of daily or weekly vaccinations) in determining whether a country did a “good job” in administering COVID vaccines to its populace.
## `mutate_if()` ignored the following grouping variables:
## Column `country`
Now let’s merge all these vaccination dfs into one dataframe so we can create a VSI to predict!
First we’ll dump all the vaccine-related stuff into one big dataframe here:
Now we’ll pair it down to keep only key variables that will be meaningful in denoting a country’s vaccination success. Variables were selected to be meaningful and non-redundant, and to have high completion rates.
| Name | vac.sum |
| Number of rows | 225 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| factor | 1 |
| numeric | 7 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| country | 0 | 1 | FALSE | 225 | Uni: 1, Gam: 1, Lao: 1, Egy: 1 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| days.to.start_rel | 47 | 0.79 | 79.3 | 49.5 | 0.00 | 39.0 | 76.5 | 103.5 | 321 | ▇▇▂▁▁ |
| days.to.10pct_fromstart | 82 | 0.64 | 140.0 | 80.5 | 3.00 | 91.0 | 114.0 | 178.0 | 362 | ▂▇▃▂▂ |
| daily.vac.per.mil_6moavg | 47 | 0.79 | 2989.6 | 3239.9 | 4.20 | 668.8 | 2341.9 | 4121.4 | 24370 | ▇▁▁▁▁ |
| people_vaccinated_per_hundred_last | 45 | 0.80 | 54.7 | 27.9 | 0.08 | 31.2 | 62.5 | 77.2 | 125 | ▆▅▇▇▁ |
| daily.vac.per.mil_max | 41 | 0.82 | 8838.4 | 8626.2 | 5.00 | 4170.5 | 8081.5 | 11118.2 | 92926 | ▇▁▁▁▁ |
| total_vaccinations_per_hundred_last | 53 | 0.76 | 122.9 | 75.2 | 0.09 | 54.5 | 127.9 | 188.1 | 336 | ▇▇▇▅▁ |
| people_fully_vaccinated_per_hundred_last | 53 | 0.76 | 49.7 | 27.6 | 0.07 | 23.2 | 56.8 | 72.7 | 121 | ▆▃▇▆▁ |
Next we’ll perform PCA to create a score to summarize these important vaccination metrics into a continuous rating, the VSI.
## PC1 PC2 PC3 PC4 PC5
## days.to.start_rel 0.260 -0.7951 0.400 0.362 -0.0654
## days.to.10pct_fromstart 0.353 0.3810 0.549 -0.128 -0.6362
## daily.vac.per.mil_6moavg -0.359 -0.4294 -0.309 -0.425 -0.6120
## people_vaccinated_per_hundred_last -0.421 0.0934 0.209 0.420 -0.0364
## daily.vac.per.mil_max -0.368 -0.1051 0.613 -0.579 0.3764
## total_vaccinations_per_hundred_last -0.428 0.0799 0.108 0.223 -0.2588
## people_fully_vaccinated_per_hundred_last -0.428 0.1099 0.113 0.332 -0.0798
## PC6 PC7
## days.to.start_rel 0.0523 0.0482
## days.to.10pct_fromstart -0.0869 0.0271
## daily.vac.per.mil_6moavg -0.1900 -0.0220
## people_vaccinated_per_hundred_last -0.5098 -0.5765
## daily.vac.per.mil_max 0.0172 0.0137
## total_vaccinations_per_hundred_last 0.8082 -0.1715
## people_fully_vaccinated_per_hundred_last -0.2008 0.7965
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.265 0.928 0.6772 0.5189 0.4821 0.17890 0.12157
## Proportion of Variance 0.733 0.123 0.0655 0.0385 0.0332 0.00457 0.00211
## Cumulative Proportion 0.733 0.856 0.9216 0.9601 0.9933 0.99789 1.00000
PC1 explains 73% of the variance and is therefore an excellent index indicating how well a country did at vaccinating their population. Based on the relative loadings of each variable, we’ll take the inverse of PC1 so that a higher metric indicates a more successful vaccination campaign. We are calling it the “Vaccination Success Index”, i.e. VSI.
Let’s see how the VSI loadings are distributed across countries:
| Name | vac.sum.with.VSI |
| Number of rows | 133 |
| Number of columns | 9 |
| _______________________ | |
| Column type frequency: | |
| factor | 1 |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| country | 0 | 1 | FALSE | 133 | Gam: 1, Lao: 1, Egy: 1, Kor: 1 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| days.to.start_rel | 0 | 1 | 69.3 | 39.78 | 0.00 | 33.00 | 69.00 | 97.00 | 181.00 | ▆▆▇▂▁ |
| days.to.10pct_fromstart | 0 | 1 | 137.3 | 79.91 | 3.00 | 91.00 | 111.00 | 176.00 | 362.00 | ▂▇▂▂▁ |
| daily.vac.per.mil_6moavg | 0 | 1 | 2779.6 | 2231.06 | 73.00 | 924.00 | 2595.50 | 4012.50 | 14248.67 | ▇▅▁▁▁ |
| people_vaccinated_per_hundred_last | 0 | 1 | 58.0 | 24.33 | 11.43 | 37.51 | 65.14 | 78.40 | 95.03 | ▅▅▅▇▇ |
| daily.vac.per.mil_max | 0 | 1 | 8199.3 | 4227.80 | 800.00 | 5450.00 | 8312.00 | 10921.00 | 21207.00 | ▆▇▇▂▁ |
| total_vaccinations_per_hundred_last | 0 | 1 | 128.8 | 69.61 | 13.65 | 70.03 | 129.12 | 193.95 | 309.40 | ▇▇▆▇▁ |
| people_fully_vaccinated_per_hundred_last | 0 | 1 | 52.0 | 24.74 | 6.12 | 31.97 | 58.61 | 72.72 | 92.51 | ▆▃▅▇▆ |
| VSI | 0 | 1 | 0.0 | 2.27 | -4.56 | -1.83 | 0.41 | 1.84 | 4.99 | ▃▃▆▇▁ |
## 133 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 110 codes from the map weren't represented in your data
Based on the distribution plot, we can see that Somalia appears to be the country with the lowest VSI, and Cuba appears to have the highest VSI. The world map furthermore indicates that generally, North and South America, Europe, Australia and Japan are some regions with higher VSI scores, whereas regions including Africa, Russia and some Asian countries seem to have lower VSI scores. Note that countries without any color are not included in this dataset as they did not report sufficient vaccination data required for the computation of VSI; notable absences include China and the UK.
We’ll write our final, cleaned dataset (final.df) by joining our WSD predictors, prevaccination COVID case and death predictors, and the VSI.
Upon merging, we are left with data on 99 unique countries.
In this section, we are using socioeconomic factors and pre-vaccination COVID cases and deaths to predict Vaccination Success Index (VSI). Note that VSI corresponds the inverse signed PC1 scores and (based on the relative loadings of each raw vaccination value) indicates how well a country did at getting people vaccinated, i.e. a positive score indicates a good vaccination performance, while a negative score indicates a poor vaccination performance.
We’ll start by subsetting our dataset of 99 countries into data.train(85%) and data.val (15%). These splits were chosen because our dataset has a relatively small number of predictors and our most robust modeling methods have means of estimating testing error (Random Forest’s OOB error, Regsubsets Cp, Neural Networks’ internal data.val split). data.val will be reserved to validate our final ensemble model.
Let’s now see how various socioeconomic, sustainability and political factors may be associated with VSI. To develop the most accurate prediction, we will compare a range of methods going from supervised to unsupervised:
In the first approach, we are creating several linear models by selecting variables based on domain knowledge. Each model is focused on specific themes that may be of importance for predicting a country’s VSI score. The following themes will be covered: + Economic factors + “COVID burden” factors, i.e. how heavily a country suffered in terms of number of cases and deaths prior to vaccination start + Political and country development factors + Full model: we combine the most relevant factors into a comprehensive model
All linear models are compiled using backward selection. Summaries of the intermediate models are ommited for brevity.
Using domain knowledge, we explore the relationship between measures capturing the quality of a nation’s economy over the last reported 17 years and vaccination success.
Selected variables:
ExportGoodsServices.GDP_avg - part of the GDP resulting from exporting goods, averaged over 17 yearsFinalConsumptionExpenditure.GDP_avg - apart of the GDP resulting from consumption/expenses of the country, averaged over 17 yearsGDP.Current_avg - current overall GDP, averaged over 17 yearsGDP.PerCapita.Current_avg - current GDP per capita, averaged over 17 yearsConsumerPriceInflation_avg - how much a country was affected by price inflation, averaged over 17 yearsWe further refine our model by backward selection.
| Dependent variable: | ||||
| VSI | ||||
| (1) | (2) | (3) | (4) | |
| ExportGoodsServices.GDP_avg | 0.004 | |||
| (0.009) | ||||
| FinalConsumptionExpenditure.GDP_avg | -0.004 | -0.007 | ||
| (0.015) | (0.014) | |||
| GDP.Current_avg | 0.000 | 0.000 | 0.000 | |
| (0.000) | (0.000) | (0.000) | ||
| GDP.PerCapita.Current_avg | 0.0001*** | 0.0001*** | 0.0001*** | 0.0001*** |
| (0.00001) | (0.00001) | (0.00001) | (0.00001) | |
| ConsumerPriceInflation_avg | -0.086** | -0.088** | -0.083** | -0.083** |
| (0.042) | (0.041) | (0.040) | (0.040) | |
| Constant | -0.239 | 0.203 | -0.400 | -0.400 |
| (1.540) | (1.260) | (0.354) | (0.352) | |
| Observations | 84 | 84 | 84 | 84 |
| R2 | 0.402 | 0.400 | 0.398 | 0.397 |
| Adjusted R2 | 0.363 | 0.369 | 0.375 | 0.382 |
| Residual Std. Error | 1.670 (df = 78) | 1.660 (df = 79) | 1.650 (df = 80) | 1.640 (df = 81) |
| F Statistic | 10.500*** (df = 5; 78) | 13.100*** (df = 4; 79) | 17.600*** (df = 3; 80) | 26.700*** (df = 2; 81) |
| Note: | p<0.1; p<0.05; p<0.01 | |||
GDP per capita and consumer price inflation have a significant effect on Vaccination Success Index at the 0.05 level. More specifically, the model indicates that an increase in the per-capita GDP of a nation averaged over the last 17 years corresponds to an increase in the success of that country’s vaccination campaign. Conversely, an increase in consumer price inflation averaged over the last 17 years corresponds to a decrease in a country’s vaccination success.
Let’s now build a model focused on COVID related factors. Note that all variables correspond to COVID cases and deaths reported prior to a country’s given vaccination start date.
Selected variables:
monthly.new.cases.per.100_max - max monthly case rate reached prior to vaccination startmonthly.new.deaths.per.100_max - max monthly death rate reached prior to vaccination startcumulative_total_deaths_per_hundred_prevac - cumulative number of deaths on last report prior to vaccinations beginning in each country, normalized by populationAgain, we refine our model by backward selection.
| Dependent variable: | |||
| VSI | |||
| (1) | (2) | (3) | |
| monthly.new.cases.per.100_max | 1.360*** | 1.110*** | 1.260*** |
| (0.484) | (0.333) | (0.287) | |
| monthly.new.deaths.per.100_max | -25.600 | ||
| (36.300) | |||
| cumulative_total_deaths_per_hundred_prevac | 8.770 | 4.160 | |
| (8.060) | (4.690) | ||
| Constant | -0.942*** | -0.923*** | -0.856*** |
| (0.309) | (0.307) | (0.297) | |
| Observations | 84 | 84 | 84 |
| R2 | 0.204 | 0.200 | 0.192 |
| Adjusted R2 | 0.175 | 0.180 | 0.182 |
| Residual Std. Error | 1.900 (df = 80) | 1.890 (df = 81) | 1.890 (df = 82) |
| F Statistic | 6.850*** (df = 3; 80) | 10.100*** (df = 2; 81) | 19.500*** (df = 1; 82) |
| Note: | p<0.1; p<0.05; p<0.01 | ||
The peak number of monthly new cases per hundred prior to vaccination start has a significant effect on VSI. Therefore, vaccination success seems influenced by a country’s peak number of cases prior to starting its vaccination campaign, where a higher number of cases leads to a better vaccination score. This could be explained by the fact that people may have a stronger vaccination incentive when COVID cases are more frequent around them.
Let’s build a model focused on political factors as well as indices reflecting the development of a country.
Selected variables:
WorldRegion - part of the world (more specific than continents)Electricity.Access_avg - reported electricity access, may influence the infrastructure available for vaccination, averaged over 17 yearsGDP.PerCapita.Current_avg - GDP per capita, averaged over 17 yearsUrbanPopulation.Prop_avg - how urban is the population, averaged over 17 yearsLifeExpenctancy_avg - life expectancy, averaged over 17 yearsCompulsoryEducationDurationYears_latest - latest level of mandated educationWomenInBusinessLawIndex_avg - proportion of women working in law and business fields, averaged over 17 yearsWorldBankIncomeClass_avg - categorical classifying as low, lower-middle, upper-middle, and high-income countries, averaged over 17 yearsRegimeType_avg - categorical classifying as Closed Autocracy, Electoral Autocracy, Electoral Democracy, Liberal DemocracyIndividualsUsingInternet_latest - latest internet usage## Anova Table (Type II tests)
##
## Response: VSI
## Sum Sq Df F value Pr(>F)
## WorldRegion 25.8 6 5.68 9.7e-05 ***
## Electricity.Access_avg 1.8 1 2.36 0.1299
## GDP.PerCapita.Current_avg 0.0 1 0.04 0.8394
## UrbanPopulation.Prop_avg 0.1 1 0.18 0.6757
## LifeExpenctancy_avg 14.9 1 19.72 3.8e-05 ***
## CompulsoryEducationDurationYears_latest 2.9 1 3.80 0.0559 .
## WomenInBusinessLawIndex_avg 7.7 1 10.12 0.0023 **
## WorldBankIncomeClass_avg 0.6 3 0.26 0.8557
## RegimeType_avg 8.1 3 3.56 0.0193 *
## IndividualsUsingInternet_latest 0.8 1 1.02 0.3167
## Residuals 46.2 61
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We refine our model by backward selection.
## Anova Table (Type II tests)
##
## Response: VSI
## Sum Sq Df F value Pr(>F)
## WorldRegion 23.5 6 5.07 0.00023 ***
## LifeExpenctancy_avg 26.8 1 34.84 1.2e-07 ***
## WomenInBusinessLawIndex_avg 11.8 1 15.34 0.00021 ***
## RegimeType_avg 12.8 3 5.53 0.00185 **
## Residuals 53.2 69
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
World region and 17-year averages of life expectancy, women in business and law professions, and regime type have a significant effect on VSI. More specifically, it appears that compared to the reference world region (Central and Southern Asia), all other regions are associated with lower VSI except Eastern and South-Eastern Asia, who have a higher VSI. Higher life expectancy and more women employed in business or law are associated with higher VSI. Finally, compared to the reference regime type (closed autocracy, which includes countries such as Angola, Morocco, Jordan, etc.), other more liberal regime types are associated with lower VSI scores. This likely reflects the imposition of vaccination mandates - which often face opposition in countries with robust legal precedents for health autonomy - in more authoritarian countries.
We will now use all relevant variables found in the economic, COVID burden and political/development models and feeding them into one overall model, with the goal to capture a more diverse set of factors that may impact VSI the strongest.
Selected variables:
GDP.PerCapita.Current_avg - current GDP per capita, averaged over 17 yearscumulative_total_cases_per_hundred_prevac - last reported total number of cases prior to vaccination startWorldRegion - part of the world (more specific than continents)LifeExpenctancy_avg - life expectancy, averaged over 17 yearsWomenInBusinessLawIndex_avg - proportion of women working in law and business fields, averaged over 17 yearsRegimeType_avg - categorical classifying as Closed Autocracy, Electoral Autocracy, Electoral Democracy, Liberal Democracy## Anova Table (Type II tests)
##
## Response: VSI
## Sum Sq Df F value Pr(>F)
## GDP.PerCapita.Current_avg 0.9 1 1.15 0.28734
## ConsumerPriceInflation_avg 1.1 1 1.37 0.24601
## monthly.new.cases.per.100_max 0.1 1 0.07 0.79022
## WorldRegion 22.1 6 4.79 0.00042 ***
## LifeExpenctancy_avg 16.3 1 21.22 1.9e-05 ***
## WomenInBusinessLawIndex_avg 11.5 1 14.92 0.00026 ***
## RegimeType_avg 10.9 3 4.74 0.00470 **
## Residuals 50.7 66
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now we refine our model by backward selection.
##
## <table style="text-align:center"><caption><strong>Full Model</strong></caption>
## <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td><em>Dependent variable:</em></td></tr>
## <tr><td></td><td colspan="1" style="border-bottom: 1px solid black"></td></tr>
## <tr><td style="text-align:left"></td><td>VSI</td></tr>
## <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">WorldRegionEastern and South-Eastern Asia</td><td>0.131</td></tr>
## <tr><td style="text-align:left"></td><td>(0.612)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">WorldRegionEurope and Northern America</td><td>-1.970<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.509)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">WorldRegionLatin America and Caribbean</td><td>-1.060<sup>**</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.470)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">WorldRegionNorthern Africa and Western Asia</td><td>-1.500<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.487)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">WorldRegionOceania</td><td>-2.370<sup>**</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(1.020)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">WorldRegionSub-Saharan Africa</td><td>-1.820<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.564)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">LifeExpenctancy_avg</td><td>0.166<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.028)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">WomenInBusinessLawIndex_avg</td><td>0.038<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.010)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">RegimeType_avgElectoral Autocracy</td><td>-1.330<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.463)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">RegimeType_avgElectoral Democracy</td><td>-1.410<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.527)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">RegimeType_avgLiberal Democracy</td><td>-0.482</td></tr>
## <tr><td style="text-align:left"></td><td>(0.605)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">Constant</td><td>-12.200<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(2.010)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>81</td></tr>
## <tr><td style="text-align:left">R<sup>2</sup></td><td>0.850</td></tr>
## <tr><td style="text-align:left">Adjusted R<sup>2</sup></td><td>0.826</td></tr>
## <tr><td style="text-align:left">Residual Std. Error</td><td>0.878 (df = 69)</td></tr>
## <tr><td style="text-align:left">F Statistic</td><td>35.400<sup>***</sup> (df = 11; 69)</td></tr>
## <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td style="text-align:right"><sup>*</sup>p<0.1; <sup>**</sup>p<0.05; <sup>***</sup>p<0.01</td></tr>
## </table>
When using all relevant variables found in prior models, it turns out that world region, life expectancy, the number of women employed in business and law, and regime type have a significant effect on VSI. Diagnostic plots show that the model sufficiently fulfills linear model assumptions of normality and homoscedasticity.
Let’s now try a slightly more data-driven approach, i.e. using LASSO to select the most relevant variables.
[1] 0.581 [1] “IndividualsUsingInternet_avg” “LifeExpenctancy_avg”
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(var.1se)` instead of `var.1se` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
| Dependent variable: | |
| VSI | |
| IndividualsUsingInternet_avg | 0.019** |
| (0.008) | |
| LifeExpenctancy_avg | 0.171*** |
| (0.023) | |
| Constant | -13.000*** |
| (1.470) | |
| Observations | 75 |
| R2 | 0.750 |
| Adjusted R2 | 0.743 |
| Residual Std. Error | 1.060 (df = 72) |
| F Statistic | 108.000*** (df = 2; 72) |
| Note: | p<0.1; p<0.05; p<0.01 |
LASSO results using the coef.1se on 75 complete countries suggest again that life expectancy is an important predictor of VSI. In addition, the LASSO approach selected the 17 year average number of individuals using internet. When feeding these into a linear model, both have a significant effect on VSI.
Next we’ll run regsubsets to optimize a model by minimizing Cp. This does return a warning that regsubsets still finds our data to be colinear, so we re-run correlation reduction with 95% cutoff on data.train to whether we can build a more optimal model. While this alternate model does better in modeling VSI (see Comparing Regsubsets Models in the Appendix), we felt this improvement in performance was not worth the lack of comparability with LASSO, RF and other models built on the full data.train set of predictors and thus proceed with the following model.
## Reordering variables and trying again:
## [1] "WorldRegionEastern and South-Eastern Asia"
## [2] "WorldBankIncomeClass_latestUpper-middle income"
## [3] "WorldBankIncomeClass_avgUpper-middle income"
## [4] "Govt.FinalConsumptionExpenditure.GDP_latest"
## [5] "WomenInBusinessLawIndex_avg"
## [6] "LifeExpenctancy_avg"
## [7] "cumulative_total_cases_prevac"
## [8] "weekly_new_deaths_per_hundred_prevac"
## Anova Table (Type II tests)
##
## Response: VSI
## Sum Sq Df F value Pr(>F)
## WorldRegion 24.7 6 5.66 9.5e-05 ***
## WorldBankIncomeClass_latest 7.4 3 3.38 0.0236 *
## WorldBankIncomeClass_avg 3.1 3 1.42 0.2451
## Govt.FinalConsumptionExpenditure.GDP_latest 5.5 1 7.53 0.0079 **
## WomenInBusinessLawIndex_avg 8.3 1 11.35 0.0013 **
## LifeExpenctancy_avg 23.0 1 31.61 4.6e-07 ***
## cumulative_total_cases_prevac 1.6 1 2.19 0.1442
## weekly_new_deaths_per_hundred_prevac 0.7 1 0.95 0.3326
## Residuals 45.9 63
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now let’s run some backward selection on this model:
## Anova Table (Type II tests)
##
## Response: VSI
## Sum Sq Df F value Pr(>F)
## WorldRegion 30.0 6 6.76 1.2e-05 ***
## WorldBankIncomeClass_latest 8.4 3 3.76 0.01463 *
## Govt.FinalConsumptionExpenditure.GDP_latest 4.6 1 6.26 0.01476 *
## WomenInBusinessLawIndex_avg 9.3 1 12.57 0.00071 ***
## LifeExpenctancy_avg 20.8 1 28.04 1.4e-06 ***
## Residuals 50.4 68
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Reducing to only significantly predictive variables, we find they are fairly similar to those identified by our full linear model. This is reassuring to have our data-driven model identify similar predictors to our hypothesis-driven models.
Now, we will try to use tree based methods, specifically Random Forest, to see if we can predict a country’s vaccination success based on social and economic measures prior to the pandemic, as well as COVID case data prior to the start of vaccination campaigns.
While an advantage of using a single tree is that we can see which variables are diving our prediction, Random Forest may provide better results at the cost of interpretability. Accordingly, now we will tune and train a Random Forest model. For those interested, an example of a single regression tree can be found in the Appendix under Regression Tree.
Let us tune the parameters mtry (number of variables to sample) and ntree (number of trees to use). First, we will tune ntree.
Based on out of bag error, it seems that 100 trees is sufficient. Now, let’s tune mtry.
An mtry of around 30 seems to provide good results, so we will use this as our parameter. Now, we train the final Random Forest model based on our tuned parameters.
The final Random Forest model, now that it’s been tuned, is performing with an Out-Of-Bag Testing error of 1.087. While our final testing error is still somewhat high, Random Forest can achieve a somewhat accurate estimate of a nation’s COVID vaccination success, given social and economic factors prior to the pandemic, and COVID case data prior to the start of vaccination campaigns.
Next, we’ll run neural nets. Though the data has high completion to begin with (only 1.53% missing in df.pred), neural nets cannot take NA inputs. Because the benefits of NN arise in large datasets and our data is relatively small to begin with, we’ll impute missing values prior to building, training and fitting our network.
Imputing, setting dummy vars & re-subsetting data.imp.train and data.imp.val:
## NRMSE PFC
## 1.89e-10 1.36e-01
Based on the very low normalized root mean squared error (NRMSE = 1.895^{-10}) of numeric variables and proportion of falsely classified entries (PFC = 0.136) estimated for categorical variables, we can feel confident in using these imputed values to train our neural network.
We’ll now proceed to build and compile our model:
## Loaded Tensorflow version 2.8.0
## Model: "sequential"
## ________________________________________________________________________________
## Layer (type) Output Shape Param # Trainable
## ================================================================================
## batch_normalization (BatchNor (None, 73) 292 Y
## malization)
## dense_2 (Dense) (None, 32) 2368 Y
## dropout_1 (Dropout) (None, 32) 0 Y
## dense_1 (Dense) (None, 12) 396 Y
## dropout (Dropout) (None, 12) 0 Y
## dense (Dense) (None, 1) 13 Y
## ================================================================================
## Total params: 3,069
## Trainable params: 2,923
## Non-trainable params: 146
## ________________________________________________________________________________
Finally we fit, outputting MSE for both the training data and an internal validation sample (15% of data.train that’s been cleaned, imputed and normalized for NN input).
The lowest MSE calculated on the internal validation sample is 1.238 at the 60. Now that we’ve defined our model, we can refit it without the internal validation split and using 60 epochs.
## Model: "sequential_1"
## ________________________________________________________________________________
## Layer (type) Output Shape Param # Trainable
## ================================================================================
## batch_normalization_1 (BatchN (None, 73) 292 Y
## ormalization)
## dense_5 (Dense) (None, 32) 2368 Y
## dropout_3 (Dropout) (None, 32) 0 Y
## dense_4 (Dense) (None, 12) 396 Y
## dropout_2 (Dropout) (None, 12) 0 Y
## dense_3 (Dense) (None, 1) 13 Y
## ================================================================================
## Total params: 3,069
## Trainable params: 2,923
## Non-trainable params: 146
## ________________________________________________________________________________
The MSE of this model is now 1.556, slightly larger than the training loss on the original (MSE at epoch 65 = 1.259).
Now we’ll attempt to create the best model by bagging those created above, averaging the predicted VSI values from each model. Each country in the training data’s VSI is estimated by each model and scores are compiled in the dataframe below. An ensemble score is then estimated by averaging across the models.
## VSI full.lm lasso rf regsub V1 en.pred
## 9 1.839 1.839 2.466 2.024 1.839 2.099 2.0533
## 74 1.136 1.066 0.269 NA 0.583 1.192 0.7775
## 76 2.861 1.866 1.507 2.311 1.620 1.423 1.7455
## 55 0.141 1.432 -2.027 NA 0.209 0.352 -0.0086
## 72 -1.006 -1.015 -1.676 -0.893 -1.162 -1.000 -1.1492
## 54 1.285 0.018 0.736 1.450 0.179 0.351 0.5467
## [1] 0.458
Hurrah! Our training MSE from our ensemble method, 0.458, is actually pretty good! We can reduce this even further by weighting the contributions of each model based on their individual training MSEs, displayed below:
## model train.mse
## 1 full.lm 0.626
## 2 LASSO 1.102
## 3 RandomForest 0.171
## 4 RegSub 0.566
## 5 NN 0.684
## 6 full.ensemble 0.458
As we can see, the RandomForest MSE is lowest while LASSO’s MSE is highest. By weighting the contributions of each model to our final prediction accordingly, we obtain a training MSE of 0.324.
We’ll conclude by running our weighted ensemble method on the validation data we split out at the beginning to see how well this model predicts VSI scores based on data from unseen countries.
## VSI full.lm lasso rf regsub V1 weighted.pred
## 1 0.608 -0.529 0.0579 -0.445 0.208 -0.376 -0.289
## 2 -2.971 0.751 -0.8299 -1.080 1.016 -0.567 -0.326
## 6 -2.909 -0.545 0.4464 -0.550 0.187 -0.385 -0.314
## 12 2.211 0.402 1.2984 1.310 0.235 1.150 0.952
## 36 -4.089 -3.051 -2.6254 -3.328 -3.190 -2.528 -3.091
## 43 -0.194 0.126 -0.9307 -0.473 0.154 0.549 -0.151
## [1] 2.04
Unfortunately the MSE from applying our weighted ensemble method to unseen validation data (2.04) is much larger than that of our training data (0.324). This suggests that the ensemble model and/or one of the underlying models bagged to create it suffered from overfitting.
## `geom_smooth()` using formula 'y ~ x'
In summary, we were able to create a variety of models predicting our VSI score. The MSE of each model’s predictions on the training data is shown here:
## model train.mse
## 1 full.lm 0.626470711796047
## 2 LASSO 1.10247014138825
## 3 RandomForest 0.170612648153805
## 4 RegSub 0.566104279987661
## 5 NN 0.683601748727079
## 6 full.ensemble 0.458311711054015
## 7 weighted.ensemble 0.324177892641474
Given that our dataset was relatively small, a training MSE of around 0.5 in predicting VSI (which ranges almost 10 pts from -4.353 to 3.165) may be considered very good performance; this was notably achieved by hypothesis-driven linear modeling, Random Forests, Regsubsets, and our weighted and unweighted ensemble models. While our final ensemble model did not perform as well on our validation data - suggesting that it and/or one of the underlying models were overfitted - we hope the methodology explored here can be improved with application to similar, more complete datasets in the future.
Beyond prediction, the linear models provide a number of insights into country-level variables that are significantly related to vaccination success. Each factor, as well as policy recommendations to address it, are noted above in the Executive Summary. However, it is worth reiterating some key findings here.
Firstly, both LASSO and Regsubsets identified similar sets of predictors to those in our hypothesis-driven thematic linear models. This is reassuring to have our data-driven model identify similar predictors to our hypothesis-driven models and validates our choices in predictors. Our hypothesis-driven linear model also had very good MSE in training relative to even purely data-driven models.
Similarly, every one of our interpretable models (full linear model, LASSO, regsubsets and decision tree) identified averaged life expectancy as a significant predictor of VSI. Life expectancy likely captures a wide range of latent predictors including healthcare access, infrastructure, public health education, and potentially attitudes and social supports that may be beneficial both in physical health outcomes and encouraging vaccination compliance.
The relationship between life expectancy and VSI is plotted here:
## `geom_smooth()` using formula 'y ~ x'
Furthermore, none of the COVID-19 death or case metrics prior to vaccination onset were found to be predictive when other, non-COVID variables were included. This is surprising, considering that one would expect COVID cases and mortality to have a strong affect on a country’s ability, national investment, and individual willingness to undertake large-scale vaccination campaigns. Our analysis is significant in the fact that it indicates non-disease specific factors are actually more relevant to vaccination against COVID-19 than the burden of the disease itself.
We hope that these analyses may be useful in shaping future discussions about the varying strengths of each nation’s COVID-19 vaccination campaigns and in global response to health crises more broadly.
Detailed graph of the number of years with data reported for each country:
Plotting change in key predictors over the 17-year data period.
Below is code to generate dataframes that where in the end not used for modeling - may be useful for EDA or post-hoc analyses:
Here’s code to subdivide the train/test dfs based on what type of WSD predictors they include (i.e. so you can easily train a model only on the WSD 17-yr averages, plus static measures and covid cases/deaths).
Code initially used to identify discrepancies in country names between COVID and WSD dataframes.
Single Regression Tree to Predict VSI. Created for illustrative purposes.